It is time!

Introduction

In this lab, we will "just" apply the Gaussian mixture model on different datasets. Try to think at this lab as a project you may work on later in your company. Stating it differently, do not just apply what we saw during lectures but try to mix all your statistical knowledge to do a great analysis. Good luck!

1. Italian wine

You may already have worked with this (famous) dataset. It consists in 178 wines samples from which we have physical and chemical mesurements (27 in totals). Although this dataset is supervised in the sense that we also have the information on the variety of wine, i.e., one of Barolo, Grignolino and Barbera, we will act as if it was unsupervised and use this information later on to assess our model performance for classification. There is also an additional feature which the year of production but again this one will be omitted, at least in a first analysis.

As you may have already guessed, your goal is to be able to predict the wine variety.

2. Breast cancer data

We now consider a dataset providing data for 569 patients on 30 features of the cell nuclei obtained from a digitized image of a fine needle aspirate (FNA) of a breast mass. For each patient the cancer was diagnosed as malignant or benign. The goal is to be able to say wether or not it malignant or benign.

3. INSEE

Go to the INSEE website where you have access to a wide range of datasets. Pick one you like (and relevant to what we are covering in this lecture) and do your job!